CAPSTONE PROJECT¶

Loan Default Prediction¶

Background¶

A major proportion of retail bank profit comes from interests in the form of home loans. These loans are borrowed by regular income/high-earning customers. Banks are most fearful of defaulters, as bad loans (NPA) usually eat up a major chunk of their profits. Therefore, it is important for banks to be judicious while approving loans for their customer base. The approval process for the loans is multifaceted. Through this process, the bank tries to check the creditworthiness of the applicant on the basis of a manual study of various aspects of the application. The entire process is not only effort-intensive but also prone to wrong judgment/approval owing to human error and biases. There have been attempts by many banks to automate this process by using heuristics. But with the advent of data science and machine learning, the focus has shifted to building machines that can learn this approval process and make it free of biases and more efficient. At the same time, one important thing to keep in mind is to make sure that the machine does not learn the biases that previously crept in because of the human approval process..

Problem Statement¶

A bank's consumer credit department aims to simplify the decision-making process for home equity lines of credit to be accepted. To do this, they will adopt the Equal Credit Opportunity Act's guidelines to establish an empirically derived and statistically sound model for credit scoring. The model will be based on the data obtained via the existing loan underwriting process from recent applicants who have been given credit. The model will be built from predictive modeling techniques, but the model created must be interpretable enough to provide a justification for any adverse behavior (rejections).

Objective¶

Build a classification model to predict clients who are likely to default on their loan and give recommendations to the bank on the important features to consider while approving a loan.

Data Dictionary¶

The Home Equity dataset (HMEQ) contains baseline and loan performance information for recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. There are 12 input variables registered for each applicant.

● BAD: 1 = Client defaulted on loan, 0 = loan repaid
● LOAN: Amount of loan approved
● MORTDUE: Amount due on the existing mortgage
● VALUE: Current value of the property
● REASON: Reason for the loan request (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
● JOB: The type of job that loan applicant has such as manager, self, etc.
● YOJ: Years at present job
● DEROG: Number of major derogatory reports (which indicates serious delinquency or late payments).
● DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due)
● CLAGE: Age of the oldest credit line in months
● NINQ: Number of recent credit inquiries
● CLNO: Number of existing credit lines
● DEBTINC: Debt-to-income ratio (all monthly debt payments divided by gross monthly income. This number is one of the ways lenders measure a borrower’s ability to manage the monthly payments to repay the money they plan to borrow)

DATASET OVERVIEW¶

In [1]:
#Using tqdm to show progress bar
! pip install tqdm
Requirement already satisfied: tqdm in c:\users\sheidu omuya yusuf\anaconda3\lib\site-packages (4.65.0)
Requirement already satisfied: colorama in c:\users\sheidu omuya yusuf\anaconda3\lib\site-packages (from tqdm) (0.4.6)

Import Necessary Libraries

In [2]:
# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set()

# split the data into train and test
from sklearn.model_selection import train_test_split

# to build linear regression_model using statsmodels
import statsmodels.api as sm

# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error

#To ignore unecessary errors
import warnings
warnings.filterwarnings("ignore")

from statsmodels.tools.sm_exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn import linear_model,svm
from imblearn.over_sampling import SMOTE
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import sklearn.linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor,RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, train_test_split

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter("ignore", ConvergenceWarning)
In [3]:
# this will help in making the Python code more structured automatically (help adhere to good coding practices)
#%load_ext nb_black

import warnings

warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning

warnings.simplefilter("ignore", ConvergenceWarning)

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV
In [4]:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt
In [5]:
from sklearn.metrics import make_scorer
In [6]:
conda install -c conda-forge scikit-learn
Note: you may need to restart the kernel to use updated packages.
usage: conda-script.py [-h] [--no-plugins] [-V] COMMAND ...
conda-script.py: error: unrecognized arguments: scikit-learn
In [7]:
# To get diferent metric scores
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve,auc  
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score

from sklearn.metrics import confusion_matrix


from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
In [8]:
#Access and read the dataset
loan_predict=pd.read_csv("hmeq.csv")
In [9]:
#Make a copy of the original dataset
df=loan_predict.copy()
In [10]:
#view top 5 row of the dataset
df.head()
Out[10]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.00000 39025.00000 HomeImp Other 10.50000 0.00000 0.00000 94.36667 1.00000 9.00000 NaN
1 1 1300 70053.00000 68400.00000 HomeImp Other 7.00000 0.00000 2.00000 121.83333 0.00000 14.00000 NaN
2 1 1500 13500.00000 16700.00000 HomeImp Other 4.00000 0.00000 0.00000 149.46667 1.00000 10.00000 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97800.00000 112000.00000 HomeImp Office 3.00000 0.00000 0.00000 93.33333 0.00000 14.00000 NaN
In [11]:
#view last 5 row of the dataset
df.tail()
Out[11]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
5955 0 88900 57264.00000 90185.00000 DebtCon Other 16.00000 0.00000 0.00000 221.80872 0.00000 16.00000 36.11235
5956 0 89000 54576.00000 92937.00000 DebtCon Other 16.00000 0.00000 0.00000 208.69207 0.00000 15.00000 35.85997
5957 0 89200 54045.00000 92924.00000 DebtCon Other 15.00000 0.00000 0.00000 212.27970 0.00000 15.00000 35.55659
5958 0 89800 50370.00000 91861.00000 DebtCon Other 14.00000 0.00000 0.00000 213.89271 0.00000 16.00000 34.34088
5959 0 89900 48811.00000 88934.00000 DebtCon Other 15.00000 0.00000 0.00000 219.60100 0.00000 16.00000 34.57152
In [12]:
print('The number of rows (observations) is', df.shape[0],'\n''The number of columns (variables) is',df.shape[1])
from tqdm import tqdm
for i in tqdm (range (100), desc="Loading..."):
  pass
The number of rows (observations) is 5960 
The number of columns (variables) is 13
Loading...: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<?, ?it/s]
In [13]:
# Understand the shape of the data
df.shape
Out[13]:
(5960, 13)
  • The dataset at 5960 rows and 13 columns
In [14]:
# To print the essential information about the data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB
  • BAD and LOAN are integer, while REASON and JOB are object.
  • The remaining variables of the dataset are float.
In [15]:
#Count of number of client that repaid /defaulted
df["BAD"].value_counts()
Out[15]:
0    4771
1    1189
Name: BAD, dtype: int64
In [16]:
target = df['BAD'].value_counts()
#labels = ['No default', 'Defaulted']
#sizes = [80, 20]  # Percentages (e.g., 80% No default, 20% Defaulted)

# Colors for the slices
colors = ['#37FD12', 'red']

# Labels the features
mylabels = ['No default', 'Defaulted']

# Create the 3D pie chart
plt.pie(target, colors=colors, labels=mylabels, explode=[0, 0.1], autopct='%1.1f%%', startangle=0, labeldistance=1.2, pctdistance=0.6, shadow=True)

# Add a title
plt.title('Loan Default Status')

# Display the chart
plt.axis('equal')  # Equal aspect ratio ensures a circular pie chart
plt.show()
  • Out of the 5960 observations, 1189 loan applicants have defaulted on their obligation while 4771 repaid.
  • There is 19.9% default rate.
In [17]:
#Statistical summary of the dataset
df.describe(include = 'number').T.round(2)
Out[17]:
count mean std min 25% 50% 75% max
BAD 5960.00000 0.20000 0.40000 0.00000 0.00000 0.00000 0.00000 1.00000
LOAN 5960.00000 18607.97000 11207.48000 1100.00000 11100.00000 16300.00000 23300.00000 89900.00000
MORTDUE 5442.00000 73760.82000 44457.61000 2063.00000 46276.00000 65019.00000 91488.00000 399550.00000
VALUE 5848.00000 101776.05000 57385.78000 8000.00000 66075.50000 89235.50000 119824.25000 855909.00000
YOJ 5445.00000 8.92000 7.57000 0.00000 3.00000 7.00000 13.00000 41.00000
DEROG 5252.00000 0.25000 0.85000 0.00000 0.00000 0.00000 0.00000 10.00000
DELINQ 5380.00000 0.45000 1.13000 0.00000 0.00000 0.00000 0.00000 15.00000
CLAGE 5652.00000 179.77000 85.81000 0.00000 115.12000 173.47000 231.56000 1168.23000
NINQ 5450.00000 1.19000 1.73000 0.00000 0.00000 1.00000 2.00000 17.00000
CLNO 5738.00000 21.30000 10.14000 0.00000 15.00000 20.00000 26.00000 71.00000
DEBTINC 4693.00000 33.78000 8.60000 0.52000 29.14000 34.82000 39.00000 203.31000
  • Average loan amount approved is ~ 18608 while the maximum Loan amount approved is 89900.
  • The Average due on existing Mortgage is ~ 737609
  • The average Current value of property stands at 101776, while the maximum value is 855909
  • Average debt-to-income ratio stands at 33.77 , which is well within favourable ratio.
  • Average number of existing credit line stand at 21.
In [18]:
df.describe(include = 'object').T
Out[18]:
count unique top freq
REASON 5708 2 DebtCon 3928
JOB 5681 6 Other 2388
  • There are two reasons why client applied for loan and loan applicant are of six different jobs, Debt consolidation appeared the most
In [19]:
#find missing values
df.isnull().sum()
Out[19]:
BAD           0
LOAN          0
MORTDUE     518
VALUE       112
REASON      252
JOB         279
YOJ         515
DEROG       708
DELINQ      580
CLAGE       308
NINQ        510
CLNO        222
DEBTINC    1267
dtype: int64
  • There are missing values in the dataset
In [20]:
# check for duplicated data
df[df.duplicated()].count()
Out[20]:
BAD        0
LOAN       0
MORTDUE    0
VALUE      0
REASON     0
JOB        0
YOJ        0
DEROG      0
DELINQ     0
CLAGE      0
NINQ       0
CLNO       0
DEBTINC    0
dtype: int64
  • There are no duplicate values in the dataset
In [21]:
plt.figure(figsize = (12,8))
sns.heatmap(df.isnull(), cbar = False, cmap = 'coolwarm', yticklabels = False)
plt.show() #
  • Only BAD & LOAN have no missing values. all other columns have missing values with DEBTINC having the highiest % of missing data. 21%
In [22]:
df.isnull().sum().sort_values(ascending = False)/df.index.size
Out[22]:
DEBTINC   0.21258
DEROG     0.11879
DELINQ    0.09732
MORTDUE   0.08691
YOJ       0.08641
NINQ      0.08557
CLAGE     0.05168
JOB       0.04681
REASON    0.04228
CLNO      0.03725
VALUE     0.01879
BAD       0.00000
LOAN      0.00000
dtype: float64

I. EXPLORATION DATA ANALYSIS¶

I.1. Univariate Analysis¶

In [23]:
def univar_vis(fd):  # Define univariate visualization function
    title = fd.name
    fig, axes = plt.subplots(2, 2, figsize=(10, 6))
    fig.suptitle(title.upper() + "   " + "Distribution")
    sns.distplot(fd, color="green", bins=5, ax=axes[0, 0], hist=None)
    sns.boxplot(fd, ax=axes[0, 1])
    sns.violinplot(fd, ax=axes[1, 0])
    sns.histplot(fd, ax=axes[1, 1])

    axes[0, 0].axvline(fd.mean(), color="black", linewidth=0.7)
    axes[0, 0].axvline(fd.median(), color="red", linewidth=0.3)
    axes[0, 1].axvline(fd.median(), color="red", linewidth=0.9)
    axes[0, 1].axvline(fd.mean(), color="purple", linewidth=0.7)
    axes[1, 0].axvline(fd.mean(), color="purple", linewidth=0.7)
    axes[1, 0].axvline(fd.median(), color="green", linewidth=0.7)
    axes[1, 1].axvline(fd.mean(), color="purple", linewidth=0.7)
    axes[1, 1].axvline(fd.median(), color="green", linewidth=0.7)
    plt.tight_layout()
    plt.show()
In [24]:
univar_vis(df["DEBTINC"])
In [25]:
univar_vis(df["DEROG"])
In [26]:
univar_vis(df["DELINQ"])
In [27]:
univar_vis(df["MORTDUE"])
In [28]:
univar_vis(df["YOJ"])
In [29]:
univar_vis(df["NINQ"])
In [30]:
univar_vis(df["CLAGE"])
In [31]:
univar_vis(df["CLNO"])
In [32]:
univar_vis(df["VALUE"])
In [33]:
univar_vis(df["BAD"])
In [34]:
univar_vis(df["LOAN"])
  • LOAN, MORTDUE, VALUE, YOJ, DEROG,DELINQ, CLAGE, NINQ, CLNO, DEBTINC are all not normally distributed.
  • They are heavily skewed to the right, showing the presence of outliers

I.2. Bivariate Analysis¶

In [35]:
# Display columns
df.columns
Out[35]:
Index(['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'REASON', 'JOB', 'YOJ', 'DEROG',
       'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC'],
      dtype='object')
In [36]:
num_col = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG',
       'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']

cat_coln = [ 'REASON', 'JOB',]
In [37]:
sns.countplot(x ='JOB', data=df, hue='BAD')  #Job vs default /repaid


plt.show()
  • People who filled in others as their JOB are the highest number of clients who took loans. They have the highest defaulters
  • Sales and self employed people represent the least number of clients who took loans
In [38]:
sns.countplot(x ='REASON', data=df, hue='BAD') # Reason for loan vs default/repaid


plt.show()
  • Client who took loan debt consolidation are more. and also have the highest number of clients who repaid their loans
In [39]:
fig,axes = plt.subplots(5,2,figsize=(12,15))
for idx,cat_col in enumerate(num_col):
    row,col = idx//2,idx%2
    sns.boxplot(y=cat_col,data=df,x='BAD',ax=axes[row,col])


plt.subplots_adjust(hspace=1)
  • Clients with relatively high loan amount repaid
  • More client with high amount due on existing mortgage defaulted
  • More client with high current value of property seems to have defaulted the more the number of derogatory remarks on client the higher the possibility of default
  • Also, the higher the number of delinquent credit lines the higher the chances of default
  • As the number of credit enquiry increase the possibility of default also slightly increases
  • The more the number of existing credit the more the chances of default
  • The most obviously the higher the debt-to-income ration of a client the higher the chances of default
In [40]:
# pair plot showing relationsh amoung variables
sns.pairplot(data=df, hue='BAD', corner=True)
from time import sleep
from tqdm import tqdm
for i in tqdm (range (10)):
  sleep(3)
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:30<00:00,  3.00s/it]
In [41]:
# heatmap
plt.figure(figsize=(12,12))
sns.heatmap(df[num_col].corr(),annot=True)
Out[41]:
<Axes: >
  • MORTDUE is highly correlated to VALUE i.e The amount due on existing mortgage is highly correlated to Current value of the property
In [42]:
dfl = df.groupby(["JOB"])[["LOAN"]].mean()
dfl
Out[42]:
LOAN
JOB
Mgr 19155.28031
Office 18142.61603
Other 18061.68342
ProfExe 18983.46395
Sales 14913.76147
Self 28314.50777
  • Self employed client collected the highest loan amount on the average.
In [43]:
ax = sns.catplot(
    x="JOB", y="LOAN", data=df, kind="bar", height=4.5, aspect=3
)

ax.set_xticklabels(rotation=90).set(
    title="JOB TYPE vs. AVERAGE LOAN AMOUNT");
  • Self employed client collected the highest loan amount on the average.
  • Clients who are sales professionals got the least.
In [44]:
dfl2= df.groupby(["REASON"])[["LOAN"]].mean()
dfl2
Out[44]:
LOAN
REASON
DebtCon 19952.95316
HomeImp 16006.62921
  • Clients whose took loan for debt consolidation took more on the average.
In [45]:
ax = sns.catplot(
    x="REASON", y="LOAN", data=df, kind="bar", height=4.5, aspect=3
)

ax.set_xticklabels(rotation=0).set(
    title="REASON FOR LOAN  WITH AVERAGE LOAN AMOUNT");
  • Clients who took loan for debt consolidation took more on the average.
In [46]:
ax = sns.catplot(
    x="REASON", y="MORTDUE", data=df, kind="bar", height=4.5, aspect=3
)

ax.set_xticklabels(rotation=0).set(
    title="REASON FOR LOAN  vs AMOUNT DUE ON EXISTING MORTGAGE");
  • On the average, amount due on the existing mortgage does not change much for irrespective of the reason for loan
In [47]:
ax = sns.catplot(
    x="JOB", y="MORTDUE", data=df, kind="bar", height=4.5, aspect=3
)

ax.set_xticklabels(rotation=0).set(
   title="JOB TYPE  vs. AMOUNT DUE ON EXISTING MORTGAGE");
  • Self employed client have the highest average amount due on exixting mortgage, Client who chose orders as their job have the least
In [48]:
ax = sns.catplot(
    x="JOB", y="DELINQ", data=df, kind="bar", height=4.5, aspect=3
)  # Applicant's job type vs average number of delinquent credit line

ax.set_xticklabels(rotation=00).set(
    title="LOAN APPLICANT'S JOB    WITH AVERAGE NUMBER OF DELINQUENT CREDIT LINES");
In [49]:
x = sns.catplot(
    x="REASON", y="DELINQ", data=df, kind="bar", height=4.5, aspect=3
)

ax.set_xticklabels(rotation=90).set(
    title="REASON FOR LOAM  WITH AVERAGE NUMBER OF DELINQUENT CREDIT LINES");
In [50]:
x = sns.catplot(
    x="JOB", y="DEBTINC", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
    title="LOAN APPLICANT'S JOB  FOR  WITH AVERAGE NUMBER OF DELINQUENT CREDIT LINES");
In [51]:
x = sns.catplot(
    x="REASON", y="DEBTINC", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
    title="LOAN APPLICANT'S JOB  FOR  WITH AVERAGE NUMBER OF DELINQUENT CREDIT LINES");
In [52]:
x = sns.catplot(
    x="REASON", y="CLNO", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
    title="LOAN APPLICANT'S JOB  FOR  WITH AVERAGE NUMBER OF EXISTING CREDIT LINES");
In [53]:
x = sns.catplot(
    x="JOB", y="CLNO", data=df, kind="bar", height=4.5, aspect=3
)
ax.set_xticklabels(rotation=90).set(
    title="LOAN APPLICANT'S JOB  FOR  WITH AVERAGE NUMBER OF EXISTING  CREDIT LINES");
  • Clients who are sales professionals have the highest average number of existing credit lines

II. DATA PRE PROCESSING¶

II.1. Data Cleaning and Transformation¶

In [54]:
Defaults = df.copy()
In [55]:
print(Defaults["REASON"].value_counts())
print(Defaults["JOB"].value_counts())
DebtCon    3928
HomeImp    1780
Name: REASON, dtype: int64
Other      2388
ProfExe    1276
Office      948
Mgr         767
Self        193
Sales       109
Name: JOB, dtype: int64

The scale of each attribute is different, we need to normalize all the features. Some attributes have skewed distribution Some attributes have a lot of outliers (DEBTINC, LOAN, MORTDUE, VALUE) For normalizing, we can use Min-Max Scaler, but attributes like LOAN, MORTDUE, VALUE have a lot of outliers (from boxplot), so we will also try Z-Score Normalization (preferred). For fixing the skewness, we need to transform the attributes. Our basic transformation did improve the distribution of some attributes like LOAN, MORTDUE and VALUE.

Numerosity Reduction : Apart from the above needed steps, many tuple/observations might have many missing values in their attributes. We can consider dropping them to improve the data quality. For this we need to decide a threshold value, such that the data quality is also improved and a lot of data isn't lost. Feature Reduction - Dropping columns with same value for most of the observations (DELINQ and DEROG), and after considering their Correlation and Predictive Power Score(REASON and JOB).

Plotting Heatmap of correlation Matrix, to understand the type of linear relation between attributes. We will again plot Heatmap of correlation after cleaning and transforming the attributes.

In [56]:
Defaults["PROBINC"] = Defaults.MORTDUE/Defaults.DEBTINC # adding new feature, (current debt on mortgage)/(debt to income ratio). this feature helps to evaluate the financial stability and t
In [57]:
from scipy.stats import yeojohnson
Defaults_temp = Defaults.copy()
Defaults_temp["LOAN"] = yeojohnson(Defaults["LOAN"])[0]          # transforming LOAN using yeo-johnson method
Defaults1 = Defaults_temp.copy()
Defaults_temp["MORTDUE"] = np.power(Defaults["MORTDUE"],1/8)     # transforming MORTDUE by raising it to 1/8
Defaults_temp["YOJ"] = np.log(Defaults["YOJ"]+10)
Defaults_temp["VALUE"] = np.log(Defaults["VALUE"]+10)
Defaults_temp["CLNO"] = np.log(Defaults["CLNO"]+10)
Defaults2 = Defaults_temp.copy()

II.2. Outlier Detection & Treatment¶

II.2.1. Creating outlier identification (Lower & Upper whiskers) function**¶

In [58]:
Defaults2.columns
Out[58]:
Index(['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'REASON', 'JOB', 'YOJ', 'DEROG',
       'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'PROBINC'],
      dtype='object')
In [59]:
# Checking Outliers in dataset
col_names = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG',
       'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'PROBINC']

col_names = list(col_names)

fig, ax = plt.subplots(len(col_names), figsize=(8,50))

for i, col_val in enumerate(col_names):

    sns.boxplot(y=Defaults2[col_val], ax=ax[i])
    ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10)
    ax[i].set_xlabel(col_val, fontsize=8)

plt.show()

II.2.2. Treating the outliers¶

In [60]:
col = ['BAD','REASON','JOB']
df_X = Defaults2.drop(col, axis= 1)
df_Y = df[col]
In [61]:
def treat_outlier(x):              #Outlier treatment
    # taking 5,25,75 percentile of column
    q5= np.percentile(x,5)
    q25=np.percentile(x,25)
    q75=np.percentile(x,75)
    dt=np.percentile(x,95)
    #calculationg IQR range
    IQR=q75-q25
    #Calculating minimum threshold
    lower_bound=q25-(1.5*IQR)
    upper_bound=q75+(1.5*IQR)
    #Calculating maximum threshold
    #print(q5,q25,q75,dt,min,max)
    #Capping outliers
    return x.apply(lambda y: dt if y > upper_bound else y).apply(lambda y: q5 if y < lower_bound else y)
In [62]:
for i in df_X:
    df_X[i]=treat_outlier(df_X[i])
In [63]:
df = pd.concat([df_X, df_Y], axis = 1)
In [64]:
from sklearn.impute import SimpleImputer

# Identify columns with missing values
columns_with_missing = df.columns[df.isnull().any()]

# Impute numerical columns with mean
numerical_cols = df.select_dtypes(include='number').columns
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())

# Impute categorical columns with mode
categorical_cols = df.select_dtypes(include='object').columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])


# Verify if there are still any missing values
remaining_missing = df.isnull().sum().sum()
print(f"Remaining missing values: {remaining_missing}")
Remaining missing values: 0

The data has now been prepared and ready for the next step of model building. We will encode the categorical variable then proceed to built the models

# INSIGHTS

  1. Loan Approval and Defaults: The dataset comprises 5,960 observations, with a 20% default rate. The average approved loan amount is ~18,608, and the maximum is 89,900. Clients with relatively high loan amounts seem to have repaid successfully.

  2. Mortgage and Property Values: Average due on existing mortgages is ~737,609, with a maximum of 855,909. Higher current property values are associated with a higher default rate.

  3. Debt-to-Income Ratio: The average debt-to-income ratio is 33.77, within a favorable range. Higher debt-to-income ratios are associated with a higher default rate.

  4. Credit Lines and Enquiries: The average number of existing credit lines is 21. Higher numbers of derogatory remarks, delinquent credit lines, and credit inquiries are associated with a higher default rate.

  5. Reasons for Loan and Job Types: Debt consolidation is the most common reason for a loan, with the highest num Clients in the "Others" job category have the highest default rate.

  6. Distribution and Outliers:

Several variables are not normally distributed and exhibit right skewness, indicating the presence of outliers.

  1. Correlations: MORTDUE (amount due on existing mortgage) is highly correlated with VALUE (current property value).

  2. Job Types and Loan Amounts: Self-employed clients have the highest average loan amount, while sales professionals have the least.

  3. Debt-to-Income Ratio and Job Types: Sales professionals have the highest average debt-to-income ratio.

  4. Number of Credit Lines and Job Types:

Sales professionals have the highest average number of existing credit lines.

# RECOMMENDATIONS

  • Focus on offering moderate to high loan amounts cautiously, considering the higher default rate.
  • Evaluate property values carefully, especially for clients seeking loans against higher property values.
  • While the average is favorable, scrutinize clients with higher debt-to-income ratios more thoroughly.
  • Assess clients with multiple derogatory remarks, delinquent credit lines, and credit inquiries more cautiously.
  • Investigate clients in the "Others" job category more rigorously. Consider offering targeted solutions for debt consolidation.
  • Employ robust statistical methods and outlier detection techniques during analysis to ensure accurate modeling
  • Consider this strong correlation when assessing a client's financial situation. It might indicate potential refinancing opportunities.
  • Tailor loan offerings based on the client's job type. Self-employed clients might require more customized solutions.
  • Provide financial counseling or advice to sales professionals to manage their debt effectively.
  • Monitor credit usage and provide guidance on managing multiple credit lines responsibly.

# KEY POINTS

Prioritize a thorough evaluation of clients with characteristics linked to higher default rates. Tailor loan offerings based on job types, considering the observed differences in loan amounts and default rates. Implement stringent risk assessment for clients in the "Others" job category. Leverage the correlation between MORTDUE and VALUE for potential refinancing opportunities


II.3. Prepare dataset for modeling¶

In [65]:
df.dtypes
Out[65]:
LOAN       float64
MORTDUE    float64
VALUE      float64
YOJ        float64
DEROG      float64
DELINQ     float64
CLAGE      float64
NINQ       float64
CLNO       float64
DEBTINC    float64
PROBINC    float64
BAD          int64
REASON      object
JOB         object
dtype: object
In [66]:
from sklearn.impute import SimpleImputer

# Identify columns with missing values
columns_with_missing = df.columns[df.isnull().any()]

# Impute numerical columns with mean
numerical_cols = df.select_dtypes(include='number').columns
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())

# Impute categorical columns with mode
categorical_cols = df.select_dtypes(include='object').columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])

# Verify if there are still any missing values
remaining_missing = df.isnull().sum().sum()
print(f"Remaining missing values: {remaining_missing}")
Remaining missing values: 0
In [67]:
df.head()
Out[67]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON JOB
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 HomeImp Other
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 HomeImp Other
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 HomeImp Other
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 DebtCon Other
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 HomeImp Office
In [68]:
#Let's identify the categorical features
df.columns[df.dtypes == object]
Out[68]:
Index(['REASON', 'JOB'], dtype='object')
  • REASON and JOB are the categorical variables
  • The categorical variable REASON and JOB will be transformed using One-hot encoding

III. Logistic Regression¶

We want to predict clients who are likely to default on their loan. Before we proceed to build a model, we'll have to encode categorical features. We'll split the data into train and test to be able to evaluate the model that we build on the train data.

III.1. Data preparation for Logistic Regression model¶

In [69]:
X = df.drop(["BAD"], axis=1)
Y = df["BAD"]

# adding constant
X = sm.add_constant(X)

X = pd.get_dummies(X, drop_first=True)

III.2. Train-Test-Split for Logistic Regression model¶

In [70]:
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
In [71]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (4172, 18)
Shape of test set :  (1788, 18)
Percentage of classes in training set:
0   0.80417
1   0.19583
Name: BAD, dtype: float64
Percentage of classes in test set:
0   0.79195
1   0.20805
Name: BAD, dtype: float64

III.3. Building Logistic Regression model¶

In [72]:
# fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(disp=False)

print(lg.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                    BAD   No. Observations:                 4172
Model:                          Logit   Df Residuals:                     4154
Method:                           MLE   Df Model:                           17
Date:                Fri, 22 Dec 2023   Pseudo R-squ.:                  0.2415
Time:                        16:45:11   Log-Likelihood:                -1565.1
converged:                       True   LL-Null:                       -2063.3
Covariance Type:            nonrobust   LLR p-value:                5.014e-201
==================================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const              3.6234      1.271      2.851      0.004       1.132       6.115
LOAN              -0.1184      0.024     -4.925      0.000      -0.165      -0.071
MORTDUE           -0.8414      0.240     -3.501      0.000      -1.313      -0.370
VALUE             -0.0365      0.143     -0.256      0.798      -0.316       0.243
YOJ               -0.0777      0.136     -0.573      0.567      -0.343       0.188
DEROG              0.6250      0.059     10.656      0.000       0.510       0.740
DELINQ             0.7335      0.046     16.061      0.000       0.644       0.823
CLAGE             -0.0047      0.001     -7.182      0.000      -0.006      -0.003
NINQ               0.1667      0.025      6.597      0.000       0.117       0.216
CLNO              -0.6487      0.156     -4.167      0.000      -0.954      -0.344
DEBTINC            0.0931      0.009     10.572      0.000       0.076       0.110
PROBINC            0.0002   4.52e-05      3.610      0.000    7.46e-05       0.000
REASON_HomeImp     0.1210      0.106      1.144      0.253      -0.086       0.328
JOB_Office        -0.5862      0.182     -3.219      0.001      -0.943      -0.229
JOB_Other         -0.0562      0.141     -0.399      0.690      -0.333       0.220
JOB_ProfExe        0.0522      0.164      0.318      0.751      -0.270       0.374
JOB_Sales          0.6965      0.330      2.109      0.035       0.049       1.344
JOB_Self           0.5637      0.265      2.129      0.033       0.045       1.083
==================================================================================

Model evaluation criterion

Model can make wrong predictions as: Predicting a customer will not default but in reality, the customer will default to their loan obligation. Predicting a customer will default to their obligation but in reality, the customer will not default to their obligation.

Which case is more important?

  • Both the cases are important as:

  • If we predict that a default will not be occured and the default gets occurred then the bank will lose resources and will have to bear additional costs.

  • If we predict that a default will get occurred and the booking doesn't get occurred then the bank might not be able to provide satisfactory services to the customer by assuming that this default will be occured. This might damage the brand equity.

How to reduce the losses?

  • We will look at F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives.

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

  • The model_performance_classification_statsmodels function will be used to check the model performance of models.
  • The confusion_matrix_statsmodels function will be used to plot the confusion matrix.
In [73]:
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
    model, predictors, target, threshold=0.5
):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [74]:
# defining a function to plot the confusion_matrix of a classification model


def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Observations

  • Negative values of the coefficient show that the probability of clients defaulting to their loan obligation decreases with the increase of the corresponding attribute value.

  • Positive values of the coefficient show that the probability of customer defaulting increases with the increase of corresponding attribute value.

  • p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.

  • But these variables might contain multicollinearity, which will affect the p-values.

  • We will have to remove multicollinearity from the data to get reliable coefficients and p-values.

  • There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.

III.4. Checking Logistic Regression model performance on the training set¶

In [75]:
import sklearn.metrics
In [76]:
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Training performance:
Out[76]:
Accuracy Recall Precision F1
0 0.85139 0.35373 0.75853 0.48247

Let us check multicollinearity of variables¶

In [77]:
# we will define a function to check VIF
def checking_vif(predictors):
    vif = pd.DataFrame()
    vif["feature"] = predictors.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(predictors.values, i)
        for i in range(len(predictors.columns))
    ]
    return vif
In [78]:
checking_vif(X_train)
Out[78]:
feature VIF
0 const 750.69990
1 LOAN 1.28254
2 MORTDUE 2.29975
3 VALUE 2.49333
4 YOJ 1.07623
5 DEROG 1.08064
6 DELINQ 1.07127
7 CLAGE 1.13811
8 NINQ 1.09370
9 CLNO 1.28833
10 DEBTINC 1.15834
11 PROBINC 1.24661
12 REASON_HomeImp 1.14039
13 JOB_Office 1.88896
14 JOB_Other 2.56109
15 JOB_ProfExe 2.14889
16 JOB_Sales 1.12629
17 JOB_Self 1.26024

Observations

  • None of the numerical variables show moderate or high multicollinearity.
  • We will ignore the VIF for the dummy variables.

Let us check variables with p-value to drop them if possible¶

  • We will drop the predictor variables having a p-value greater than 0.05 as they do not significantly impact the target variable.
  • But sometimes p-values change after dropping a variable. So, we'll not drop all variables at once.
  • Instead, we will do the following:
    • Build a model, check the p-values of the variables, and drop the column with the highest p-value.
    • Create a new model without the dropped feature, check the p-values of the variables, and drop the column with the highest p-value.
    • Repeat the above two steps till there are no columns with p-value > 0.05.

The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

In [79]:
# initial list of columns
cols = X_train.columns.tolist()

# setting an initial max p-value
max_p_value = 1

while len(cols) > 0:
    # defining the train set
    x_train_aux = X_train[cols]

    # fitting the model
    model = sm.Logit(y_train, x_train_aux).fit(disp=False)

    # getting the p-values and the maximum p-value
    p_values = model.pvalues
    max_p_value = max(p_values)

    # name of the variable with maximum p-value
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
    else:
        break

selected_features = cols
print(selected_features)
['const', 'LOAN', 'MORTDUE', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'PROBINC', 'JOB_Office', 'JOB_Sales', 'JOB_Self']
In [80]:
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
In [81]:
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                    BAD   No. Observations:                 4172
Model:                          Logit   Df Residuals:                     4159
Method:                           MLE   Df Model:                           12
Date:                Fri, 22 Dec 2023   Pseudo R-squ.:                  0.2409
Time:                        16:45:11   Log-Likelihood:                -1566.2
converged:                       True   LL-Null:                       -2063.3
Covariance Type:            nonrobust   LLR p-value:                3.256e-205
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.2490      0.813      3.998      0.000       1.656       4.842
LOAN          -0.1285      0.022     -5.851      0.000      -0.171      -0.085
MORTDUE       -0.8471      0.196     -4.313      0.000      -1.232      -0.462
DEROG          0.6265      0.058     10.736      0.000       0.512       0.741
DELINQ         0.7309      0.046     16.057      0.000       0.642       0.820
CLAGE         -0.0047      0.001     -7.246      0.000      -0.006      -0.003
NINQ           0.1645      0.025      6.573      0.000       0.115       0.214
CLNO          -0.6584      0.151     -4.348      0.000      -0.955      -0.362
DEBTINC        0.0933      0.009     10.688      0.000       0.076       0.110
PROBINC        0.0002   4.53e-05      3.694      0.000    7.86e-05       0.000
JOB_Office    -0.5569      0.144     -3.880      0.000      -0.838      -0.276
JOB_Sales      0.6921      0.307      2.257      0.024       0.091       1.293
JOB_Self       0.6223      0.236      2.636      0.008       0.160       1.085
==============================================================================
In [82]:
print("Training performance:")
model_performance_classification_statsmodels(lg1, X_train1, y_train)
Training performance:
Out[82]:
Accuracy Recall Precision F1
0 0.85163 0.35373 0.76053 0.48287
  • All the variables left have p-value<0.05.
  • So we can say that lg1 is the best model for making any inference.
  • The performance on the training data is the same as before dropping the variables with the high p-value.

Coefficient interpretations¶

  • Coefficients of LOAN, MORTDUE, CLAGE, CLNO and JOB_OFFICE are negative, an increase in these will lead to a decrease in chances of a customer defaulting their loan.
  • Coefficients of DEROG, DELINQ, NING, DEBTINC, PROBINC, JOB_SALES, JOB_SELF are positive, an increase in these will lead to a increase in the chances of a customer defaulting on their loan.

III.5. Checking Logistic Regression model performance on the testing set¶

In [83]:
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test)
In [84]:
log_reg_model_test_perf = model_performance_classification_statsmodels(
    lg1, X_test1, y_test
)

print("Test performance:")
log_reg_model_test_perf
Test performance:
Out[84]:
Accuracy Recall Precision F1
0 0.82998 0.29301 0.72667 0.41762

III.6. Converting coefficients to odds*¶

  • The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
  • Therefore, odds = exp(b)
  • The percentage change in odds is given as odds = (exp(b) - 1) * 100
In [85]:
# converting coefficients to odds
odds = np.exp(lg1.params)

# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100

# removing limit from number of columns to display
pd.set_option("display.max_columns", None)

# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
Out[85]:
const LOAN MORTDUE DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC JOB_Office JOB_Sales JOB_Self
Odds 25.76339 0.87945 0.42864 1.87097 2.07704 0.99529 1.17879 0.51767 1.09780 1.00017 0.57299 1.99799 1.86326
Change_odd% 2476.33890 -12.05525 -57.13617 87.09705 107.70447 -0.47108 17.87866 -48.23295 9.77966 0.01673 -42.70095 99.79913 86.32564

Checking model performance on the training set

In [86]:
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train)
In [87]:
print("Training performance:")
log_reg_model_train_perf = model_performance_classification_statsmodels(
    lg1, X_train1, y_train
)
log_reg_model_train_perf
Training performance:
Out[87]:
Accuracy Recall Precision F1
0 0.85163 0.35373 0.76053 0.48287

ROC-AUC

  • ROC-AUC on training set
In [88]:
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
  • Logistic Regression model is giving a generalized performance on training and test set.
  • ROC-AUC score of 0.81 on training is quite good.

Model Performance Improvement

  • Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.

Optimal threshold using AUC-ROC curve

In [89]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.19138084315264084
In [90]:
# creating confusion matrix
confusion_matrix_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
In [91]:
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
Out[91]:
Accuracy Recall Precision F1
0 0.76103 0.68543 0.43077 0.52905
  • Recall has increased significantly as compared to the previous model.
  • As we will decrease the threshold value, Recall will keep on increasing and the Precision will decrease, but this is not right, we need to choose an optimal balance between recall and precision.

Let's check the performance on the test set

In [92]:
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
In [93]:
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc)
In [94]:
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
Out[94]:
Accuracy Recall Precision F1
0 0.75951 0.67742 0.44840 0.53961

Let's use Precision-Recall curve and see if we can find a better threshold

In [95]:
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
  • At 0.3 threshold we get a balanced precision and recall.
In [96]:
# setting the threshold
optimal_threshold_curve = 0.3

Checking model performance on training set

In [97]:
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train, threshold=optimal_threshold_curve)
In [98]:
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
Out[98]:
Accuracy Recall Precision F1
0 0.82814 0.53611 0.56443 0.54991
  • Model performance has improved as compared to our initial model.
  • Model has given a balanced performance in terms of precision and recall.

Let's check the performance on the test set

In [99]:
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_curve)
In [100]:
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
Out[100]:
Accuracy Recall Precision F1
0 0.81823 0.48656 0.57460 0.52693

III.7. Logistic Regress Model performance summary¶

In [101]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[101]:
Logistic Regression-default Threshold Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.85163 0.76103 0.82814
Recall 0.35373 0.68543 0.53611
Precision 0.76053 0.43077 0.56443
F1 0.48287 0.52905 0.54991
In [102]:
# test performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[102]:
Logistic Regression-default Threshold Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.82998 0.75951 0.81823
Recall 0.29301 0.67742 0.48656
Precision 0.72667 0.44840 0.57460
F1 0.41762 0.53961 0.52693

III.8. Observations from Logistic Regression model¶

Logistic Regression at threshold of 0.37 has the highest recall on training set and generalizes well on test set. The accuracy of prediction is also pretty high

IV. Support Vector Machines¶

Support Vector Machines (SVMs) are a powerful set of supervised learning methods that can be effectively used for both classification and regression tasks. We will try to use it for our classification problem and will see how it performs compared to the other models

IV.1. Data Preparation for Support vector Machines¶

In [103]:
df.head()
Out[103]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON JOB
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 HomeImp Other
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 HomeImp Other
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 HomeImp Other
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 DebtCon Other
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 HomeImp Office
In [104]:
#Let's identify the categorical features
df.columns[df.dtypes == object]
Out[104]:
Index(['REASON', 'JOB'], dtype='object')
In [105]:
# apply get_dummies function
df_encoded = pd.get_dummies(df, columns=['REASON','JOB'])
df_encoded.head()
Out[105]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON_DebtCon REASON_HomeImp JOB_Mgr JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 1 0 0 0 1 0 0 0
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 0 1 0 1 0 0 0 0
In [106]:
df_encoded.head()
Out[106]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON_DebtCon REASON_HomeImp JOB_Mgr JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 1 0 0 0 1 0 0 0
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 0 1 0 1 0 0 0 0

IV.2. Train-Test-Split for SVM model¶

In [107]:
Y = df_encoded.BAD
X = df_encoded.drop("BAD", axis = 1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y )
In [108]:
x_train.shape, x_test.shape
Out[108]:
((4768, 19), (1192, 19))
In [109]:
print("Shape of x_train: ", x_train.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of x_test: ", x_test.shape)
print("Shape of y_test: ", y_test.shape)
Shape of x_train:  (4768, 19)
Shape of y_train:  (4768,)
Shape of x_test:  (1192, 19)
Shape of y_test:  (1192,)

IV.3. Builing Support Vector Machines model¶

In [110]:
def print_score(clf, x_train, y_train, x_test, y_test, train=True):
    if train:
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(x_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(x_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(x_train))))

        res = cross_val_score(clf, x_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))

    elif train==False:
        print("Test Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(x_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(x_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(x_test))))
In [111]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [112]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [113]:
from sklearn import svm
clf = svm.SVC(kernel='rbf', gamma='auto')
clf.fit(x_train, y_train)
Out[113]:
SVC(gamma='auto')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(gamma='auto')

IV.4. Checking Support Vector Machines model performance on the training set¶

In [114]:
confusion_matrix_sklearn(clf, x_train, y_train)
In [115]:
svm_perf_train = model_performance_classification_sklearn(
    clf, x_train, y_train
)
svm_perf_train
Out[115]:
Accuracy Recall Precision F1
0 0.96099 0.92429 0.88520 0.90432

IV.5. Checking model performance on the testing set¶

In [116]:
confusion_matrix_sklearn(clf, x_test, y_test)
In [117]:
svm_perf_test = model_performance_classification_sklearn(
    clf, x_test, y_test
)
svm_perf_test
Out[117]:
Accuracy Recall Precision F1
0 0.83305 0.40756 0.62581 0.49364

IV.6. Support Vector Machines Models Observations¶

  • Recall on trainnig set is high but the model is overfitting the test set

V. Decision Tree¶

V.1. Data preparation for decision tree model¶

  • We want to predict which bookings will be canceled.
  • Before we proceed to build a model, we'll have to encode categorical features.
  • We'll split the data into train and test to be able to evaluate the model that we build on the train data.
In [118]:
df.head()
Out[118]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON JOB
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 HomeImp Other
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 HomeImp Other
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 HomeImp Other
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 DebtCon Other
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 HomeImp Office
In [119]:
#Let's identify the categorical features
df.columns[df.dtypes == object]
Out[119]:
Index(['REASON', 'JOB'], dtype='object')
  • REASON and JOB are the categorical variables
  • The categorical variable REASON and JOB will be transformed using One-hot encoding
In [120]:
# apply get_dummies function
df_encoded = pd.get_dummies(df, columns=['REASON','JOB'])
df_encoded.head()
Out[120]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON_DebtCon REASON_HomeImp JOB_Mgr JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 1 0 0 0 1 0 0 0
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 0 1 0 1 0 0 0 0
In [121]:
df_encoded.head()
Out[121]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON_DebtCon REASON_HomeImp JOB_Mgr JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 1 0 0 0 1 0 0 0
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 0 1 0 1 0 0 0 0

V.2. Train-Test Split for decision tree model¶

In [122]:
y = df_encoded.BAD
X = df_encoded.drop("BAD", axis = 1)
In [123]:
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=1, shuffle = True )
In [124]:
X_train.shape, X_test.shape
Out[124]:
((4172, 19), (1788, 19))
In [125]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (4172, 19)
Shape of test set :  (1788, 19)
Percentage of classes in training set:
0   0.80417
1   0.19583
Name: BAD, dtype: float64
Percentage of classes in test set:
0   0.79195
1   0.20805
Name: BAD, dtype: float64

V.3. Building Decision Tree Model¶

In [126]:
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)
Out[126]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

  • The model_performance_classification_sklearn function will be used to check the model performance of models.
  • The confusion_matrix_sklearnfunction will be used to plot the confusion matrix.
In [127]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [128]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

V.4. Checking Decision Tree model performance on training set¶

In [129]:
confusion_matrix_sklearn(model, X_train, y_train)
In [130]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train
Out[130]:
Accuracy Recall Precision F1
0 1.00000 1.00000 1.00000 1.00000
  • Almost 0 errors on the training set, each sample has been classified correctly.
  • Model has performed very well on the training set.
  • As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
  • Let's check the performance on test data to see if the model is overfitting.

V.5. Checking Decision Tree model performance on test set¶

In [131]:
confusion_matrix_sklearn(model, X_test, y_test)
In [132]:
decision_tree_perf_test = model_performance_classification_sklearn(
    model, X_test, y_test
)
decision_tree_perf_test
Out[132]:
Accuracy Recall Precision F1
0 0.86577 0.61290 0.70370 0.65517
  • The decision tree model is overfitting the data as expected and not able to generalize well on the test set.
  • We will have to prune the decision tree.

Before pruning the tree let's check the important features.

Plotting the feature importance of each variable

In [133]:
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="blue", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • DEBTINC is the most important feature followed by PROBINC.
  • Now let's prune the tree to see if we can reduce the complexity.

V.6. Pruning decision tree model¶

V.6.1. Pre-pruning¶

In [134]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 7, 2),
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[134]:
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In [135]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(
    estimator, X_train, y_train
)
decision_tree_tune_perf_train
Out[135]:
Accuracy Recall Precision F1
0 0.86290 0.79437 0.61633 0.69412

V.6.2. Checking performance on the training set¶

In [136]:
confusion_matrix_sklearn(estimator, X_train, y_train)
In [137]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(
    estimator, X_train, y_train
)
decision_tree_tune_perf_train
Out[137]:
Accuracy Recall Precision F1
0 0.86290 0.79437 0.61633 0.69412

V.6.3. Checking performance on the test set¶

In [138]:
confusion_matrix_sklearn(estimator, X_test, y_test)
In [139]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(
    estimator, X_test, y_test
)
decision_tree_tune_perf_test
Out[139]:
Accuracy Recall Precision F1
0 0.85235 0.74194 0.62162 0.67647

V.6.4. Visualization of the Decision Tree¶

In [140]:
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [141]:
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- DEBTINC <= 34.82
|   |--- DELINQ <= 1.50
|   |   |--- NINQ <= 4.50
|   |   |   |--- CLNO <= 2.80
|   |   |   |   |--- weights: [22.38, 28.09] class: 1
|   |   |   |--- CLNO >  2.80
|   |   |   |   |--- weights: [889.74, 107.24] class: 0
|   |   |--- NINQ >  4.50
|   |   |   |--- CLNO <= 3.01
|   |   |   |   |--- weights: [9.33, 0.00] class: 0
|   |   |   |--- CLNO >  3.01
|   |   |   |   |--- weights: [7.46, 25.53] class: 1
|   |--- DELINQ >  1.50
|   |   |--- CLNO <= 3.60
|   |   |   |--- DELINQ <= 2.50
|   |   |   |   |--- weights: [18.65, 15.32] class: 0
|   |   |   |--- DELINQ >  2.50
|   |   |   |   |--- weights: [1.24, 33.19] class: 1
|   |   |--- CLNO >  3.60
|   |   |   |--- DEROG <= 3.00
|   |   |   |   |--- weights: [23.01, 0.00] class: 0
|   |   |   |--- DEROG >  3.00
|   |   |   |   |--- weights: [0.00, 7.66] class: 1
|--- DEBTINC >  34.82
|   |--- DEBTINC <= 34.82
|   |   |--- DELINQ <= 0.50
|   |   |   |--- CLAGE <= 178.10
|   |   |   |   |--- weights: [93.26, 561.71] class: 1
|   |   |   |--- CLAGE >  178.10
|   |   |   |   |--- weights: [85.18, 155.75] class: 1
|   |   |--- DELINQ >  0.50
|   |   |   |--- CLAGE <= 390.62
|   |   |   |   |--- weights: [36.68, 668.95] class: 1
|   |   |   |--- CLAGE >  390.62
|   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |--- DEBTINC >  34.82
|   |   |--- DEBTINC <= 43.75
|   |   |   |--- CLAGE <= 202.30
|   |   |   |   |--- weights: [525.39, 268.09] class: 0
|   |   |   |--- CLAGE >  202.30
|   |   |   |   |--- weights: [360.00, 35.75] class: 0
|   |   |--- DEBTINC >  43.75
|   |   |   |--- CLAGE <= 285.54
|   |   |   |   |--- weights: [4.97, 176.17] class: 1
|   |   |   |--- CLAGE >  285.54
|   |   |   |   |--- weights: [6.84, 2.55] class: 0

Plotting the feature importance of each variable

In [142]:
# importance of features in the tree building

importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="blue", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations from decision tree

  • We can see that the tree has become simpler and the rules of the trees are readable.
  • The model performance of the model has been generalized.
  • We observe that the most important features are:
    • DEBTINC
    • CLAGE
    • DELINQ
    • CLNO
    • NINQ
    • DEROG

The rules obtained from the decision tree can be interpreted as:

  • The rules show that DEBTINC plays a key role in identifying if a client will default or not.

If we want more complex then we can go in more depth of the tree

V.6.5. Cost complexity pruning¶

In [143]:
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [144]:
pd.DataFrame(path)
Out[144]:
ccp_alphas impurities
0 0.00000 -0.00000
1 0.00000 -0.00000
2 0.00000 -0.00000
3 0.00000 -0.00000
4 0.00000 -0.00000
... ... ...
253 0.00548 0.27333
254 0.00766 0.28099
255 0.00776 0.28875
256 0.03667 0.32542
257 0.08729 0.50000

258 rows × 2 columns

In [145]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [146]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08729057011909772
In [147]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

V.6.5.1. F1 Score vs alpha for training and testing sets¶

In [148]:
f1_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = f1_score(y_train, pred_train)
    f1_train.append(values_train)

f1_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = f1_score(y_test, pred_test)
    f1_test.append(values_test)
In [149]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
In [150]:
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00038111933333653467,
                       class_weight='balanced', random_state=1)

V.6.5.2. Checking performance on the training set¶

In [151]:
confusion_matrix_sklearn(best_model, X_train, y_train)
In [152]:
decision_tree_post_perf_train = model_performance_classification_sklearn(
    best_model, X_train, y_train
)
decision_tree_post_perf_train
Out[152]:
Accuracy Recall Precision F1
0 0.95901 1.00000 0.82692 0.90526

V.6.5.3. Checking performance on the test set¶

In [153]:
confusion_matrix_sklearn(best_model, X_test, y_test)
In [154]:
decision_tree_post_test = model_performance_classification_sklearn(
    best_model, X_test, y_test
)
decision_tree_post_test
Out[154]:
Accuracy Recall Precision F1
0 0.87528 0.75538 0.68039 0.71592

Observations

  • After post pruning the decision tree the performance has generalized on training and test set.
  • We are getting high recall with this model but difference between recall and precision has increased.
In [155]:
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [156]:
# Text report showing the rules of a decision tree -

print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- DEBTINC <= 34.82
|   |--- DELINQ <= 1.50
|   |   |--- NINQ <= 4.50
|   |   |   |--- CLNO <= 2.80
|   |   |   |   |--- DEBTINC <= 22.28
|   |   |   |   |   |--- weights: [13.06, 0.00] class: 0
|   |   |   |   |--- DEBTINC >  22.28
|   |   |   |   |   |--- PROBINC <= 915.67
|   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |--- PROBINC >  915.67
|   |   |   |   |   |   |--- YOJ <= 3.43
|   |   |   |   |   |   |   |--- DEROG <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [0.62, 28.09] class: 1
|   |   |   |   |   |   |   |--- DEROG >  2.50
|   |   |   |   |   |   |   |   |--- weights: [2.49, 0.00] class: 0
|   |   |   |   |   |   |--- YOJ >  3.43
|   |   |   |   |   |   |   |--- weights: [2.49, 0.00] class: 0
|   |   |   |--- CLNO >  2.80
|   |   |   |   |--- LOAN <= 17.25
|   |   |   |   |   |--- CLAGE <= 172.84
|   |   |   |   |   |   |--- CLAGE <= 126.95
|   |   |   |   |   |   |   |--- VALUE <= 10.86
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |   |--- VALUE >  10.86
|   |   |   |   |   |   |   |   |--- weights: [6.22, 0.00] class: 0
|   |   |   |   |   |   |--- CLAGE >  126.95
|   |   |   |   |   |   |   |--- weights: [1.24, 15.32] class: 1
|   |   |   |   |   |--- CLAGE >  172.84
|   |   |   |   |   |   |--- LOAN <= 17.13
|   |   |   |   |   |   |   |--- weights: [19.27, 0.00] class: 0
|   |   |   |   |   |   |--- LOAN >  17.13
|   |   |   |   |   |   |   |--- VALUE <= 10.79
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |   |--- VALUE >  10.79
|   |   |   |   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |   |--- LOAN >  17.25
|   |   |   |   |   |--- DEBTINC <= 2.94
|   |   |   |   |   |   |--- weights: [0.00, 5.11] class: 1
|   |   |   |   |   |--- DEBTINC >  2.94
|   |   |   |   |   |   |--- DEROG <= 8.00
|   |   |   |   |   |   |   |--- YOJ <= 2.74
|   |   |   |   |   |   |   |   |--- MORTDUE <= 3.72
|   |   |   |   |   |   |   |   |   |--- LOAN <= 19.91
|   |   |   |   |   |   |   |   |   |   |--- PROBINC <= 731.84
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [3.11, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- PROBINC >  731.84
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 12.77] class: 1
|   |   |   |   |   |   |   |   |   |--- LOAN >  19.91
|   |   |   |   |   |   |   |   |   |   |--- DEBTINC <= 31.79
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [16.79, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- DEBTINC >  31.79
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |--- MORTDUE >  3.72
|   |   |   |   |   |   |   |   |   |--- VALUE <= 11.81
|   |   |   |   |   |   |   |   |   |   |--- CLAGE <= 338.47
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |   |--- CLAGE >  338.47
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- VALUE >  11.81
|   |   |   |   |   |   |   |   |   |   |--- JOB_Other <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- JOB_Other >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |--- YOJ >  2.74
|   |   |   |   |   |   |   |   |--- DEBTINC <= 33.61
|   |   |   |   |   |   |   |   |   |--- LOAN <= 20.66
|   |   |   |   |   |   |   |   |   |   |--- LOAN <= 20.59
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |   |--- LOAN >  20.59
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- LOAN >  20.66
|   |   |   |   |   |   |   |   |   |   |--- JOB_Mgr <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [266.11, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- JOB_Mgr >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |--- DEBTINC >  33.61
|   |   |   |   |   |   |   |   |   |--- DEBTINC <= 33.70
|   |   |   |   |   |   |   |   |   |   |--- YOJ <= 2.89
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.62, 7.66] class: 1
|   |   |   |   |   |   |   |   |   |   |--- YOJ >  2.89
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- DEBTINC >  33.70
|   |   |   |   |   |   |   |   |   |   |--- DEBTINC <= 34.54
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [47.88, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- DEBTINC >  34.54
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |--- DEROG >  8.00
|   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |--- NINQ >  4.50
|   |   |   |--- CLNO <= 3.01
|   |   |   |   |--- weights: [9.33, 0.00] class: 0
|   |   |   |--- CLNO >  3.01
|   |   |   |   |--- YOJ <= 3.16
|   |   |   |   |   |--- CLAGE <= 91.58
|   |   |   |   |   |   |--- VALUE <= 11.41
|   |   |   |   |   |   |   |--- weights: [3.11, 0.00] class: 0
|   |   |   |   |   |   |--- VALUE >  11.41
|   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |--- CLAGE >  91.58
|   |   |   |   |   |   |--- weights: [0.62, 22.98] class: 1
|   |   |   |   |--- YOJ >  3.16
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |--- DELINQ >  1.50
|   |   |--- CLNO <= 3.60
|   |   |   |--- DELINQ <= 2.50
|   |   |   |   |--- NINQ <= 0.50
|   |   |   |   |   |--- weights: [12.44, 0.00] class: 0
|   |   |   |   |--- NINQ >  0.50
|   |   |   |   |   |--- REASON_HomeImp <= 0.50
|   |   |   |   |   |   |--- CLAGE <= 271.60
|   |   |   |   |   |   |   |--- weights: [4.97, 0.00] class: 0
|   |   |   |   |   |   |--- CLAGE >  271.60
|   |   |   |   |   |   |   |--- weights: [0.62, 2.55] class: 1
|   |   |   |   |   |--- REASON_HomeImp >  0.50
|   |   |   |   |   |   |--- weights: [0.62, 12.77] class: 1
|   |   |   |--- DELINQ >  2.50
|   |   |   |   |--- DEBTINC <= 34.57
|   |   |   |   |   |--- weights: [0.00, 33.19] class: 1
|   |   |   |   |--- DEBTINC >  34.57
|   |   |   |   |   |--- weights: [1.24, 0.00] class: 0
|   |   |--- CLNO >  3.60
|   |   |   |--- DEROG <= 3.00
|   |   |   |   |--- weights: [23.01, 0.00] class: 0
|   |   |   |--- DEROG >  3.00
|   |   |   |   |--- weights: [0.00, 7.66] class: 1
|--- DEBTINC >  34.82
|   |--- DEBTINC <= 34.82
|   |   |--- DELINQ <= 0.50
|   |   |   |--- CLAGE <= 178.10
|   |   |   |   |--- YOJ <= 3.54
|   |   |   |   |   |--- YOJ <= 2.72
|   |   |   |   |   |   |--- YOJ <= 2.31
|   |   |   |   |   |   |   |--- MORTDUE <= 3.62
|   |   |   |   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |   |   |   |   |--- MORTDUE >  3.62
|   |   |   |   |   |   |   |   |--- weights: [1.87, 7.66] class: 1
|   |   |   |   |   |   |--- YOJ >  2.31
|   |   |   |   |   |   |   |--- MORTDUE <= 3.78
|   |   |   |   |   |   |   |   |--- weights: [3.73, 97.02] class: 1
|   |   |   |   |   |   |   |--- MORTDUE >  3.78
|   |   |   |   |   |   |   |   |--- MORTDUE <= 3.81
|   |   |   |   |   |   |   |   |   |--- weights: [2.49, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- MORTDUE >  3.81
|   |   |   |   |   |   |   |   |   |--- DEROG <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- CLAGE <= 175.40
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 7
|   |   |   |   |   |   |   |   |   |   |--- CLAGE >  175.40
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- DEROG >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.24, 40.85] class: 1
|   |   |   |   |   |--- YOJ >  2.72
|   |   |   |   |   |   |--- CLNO <= 2.80
|   |   |   |   |   |   |   |--- weights: [1.87, 51.06] class: 1
|   |   |   |   |   |   |--- CLNO >  2.80
|   |   |   |   |   |   |   |--- MORTDUE <= 4.38
|   |   |   |   |   |   |   |   |--- LOAN <= 17.16
|   |   |   |   |   |   |   |   |   |--- weights: [3.11, 48.51] class: 1
|   |   |   |   |   |   |   |   |--- LOAN >  17.16
|   |   |   |   |   |   |   |   |   |--- MORTDUE <= 3.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.49, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- MORTDUE >  3.25
|   |   |   |   |   |   |   |   |   |   |--- LOAN <= 17.57
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- LOAN >  17.57
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 10
|   |   |   |   |   |   |   |--- MORTDUE >  4.38
|   |   |   |   |   |   |   |   |--- weights: [2.49, 0.00] class: 0
|   |   |   |   |--- YOJ >  3.54
|   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |--- CLAGE >  178.10
|   |   |   |   |--- YOJ <= 2.73
|   |   |   |   |   |--- JOB_Office <= 0.50
|   |   |   |   |   |   |--- CLAGE <= 316.03
|   |   |   |   |   |   |   |--- CLAGE <= 275.98
|   |   |   |   |   |   |   |   |--- YOJ <= 2.52
|   |   |   |   |   |   |   |   |   |--- weights: [5.60, 45.96] class: 1
|   |   |   |   |   |   |   |   |--- YOJ >  2.52
|   |   |   |   |   |   |   |   |   |--- CLNO <= 3.35
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CLNO >  3.35
|   |   |   |   |   |   |   |   |   |   |--- VALUE <= 10.84
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- VALUE >  10.84
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.49, 20.43] class: 1
|   |   |   |   |   |   |   |--- CLAGE >  275.98
|   |   |   |   |   |   |   |   |--- YOJ <= 2.48
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- YOJ >  2.48
|   |   |   |   |   |   |   |   |   |--- LOAN <= 21.72
|   |   |   |   |   |   |   |   |   |   |--- NINQ <= 2.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 5.11] class: 1
|   |   |   |   |   |   |   |   |   |   |--- NINQ >  2.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- LOAN >  21.72
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.24, 0.00] class: 0
|   |   |   |   |   |   |--- CLAGE >  316.03
|   |   |   |   |   |   |   |--- weights: [0.62, 22.98] class: 1
|   |   |   |   |   |--- JOB_Office >  0.50
|   |   |   |   |   |   |--- weights: [3.11, 0.00] class: 0
|   |   |   |   |--- YOJ >  2.73
|   |   |   |   |   |--- YOJ <= 3.07
|   |   |   |   |   |   |--- VALUE <= 10.44
|   |   |   |   |   |   |   |--- weights: [0.00, 7.66] class: 1
|   |   |   |   |   |   |--- VALUE >  10.44
|   |   |   |   |   |   |   |--- VALUE <= 12.24
|   |   |   |   |   |   |   |   |--- JOB_Mgr <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [28.60, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- JOB_Mgr >  0.50
|   |   |   |   |   |   |   |   |   |--- CLAGE <= 259.30
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.11, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CLAGE >  259.30
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |   |--- VALUE >  12.24
|   |   |   |   |   |   |   |   |--- VALUE <= 12.29
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |   |   |--- VALUE >  12.29
|   |   |   |   |   |   |   |   |   |--- weights: [2.49, 0.00] class: 0
|   |   |   |   |   |--- YOJ >  3.07
|   |   |   |   |   |   |--- CLNO <= 3.51
|   |   |   |   |   |   |   |--- CLAGE <= 219.65
|   |   |   |   |   |   |   |   |--- weights: [9.33, 0.00] class: 0
|   |   |   |   |   |   |   |--- CLAGE >  219.65
|   |   |   |   |   |   |   |   |--- CLAGE <= 272.90
|   |   |   |   |   |   |   |   |   |--- YOJ <= 3.54
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.24, 12.77] class: 1
|   |   |   |   |   |   |   |   |   |--- YOJ >  3.54
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CLAGE >  272.90
|   |   |   |   |   |   |   |   |   |--- weights: [4.97, 0.00] class: 0
|   |   |   |   |   |   |--- CLNO >  3.51
|   |   |   |   |   |   |   |--- JOB_Other <= 0.50
|   |   |   |   |   |   |   |   |--- NINQ <= 4.50
|   |   |   |   |   |   |   |   |   |--- CLAGE <= 181.60
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.62, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CLAGE >  181.60
|   |   |   |   |   |   |   |   |   |   |--- YOJ <= 3.51
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- YOJ >  3.51
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 15.32] class: 1
|   |   |   |   |   |   |   |   |--- NINQ >  4.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.24, 0.00] class: 0
|   |   |   |   |   |   |   |--- JOB_Other >  0.50
|   |   |   |   |   |   |   |   |--- VALUE <= 11.97
|   |   |   |   |   |   |   |   |   |--- weights: [5.60, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- VALUE >  11.97
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 5.11] class: 1
|   |   |--- DELINQ >  0.50
|   |   |   |--- CLAGE <= 390.62
|   |   |   |   |--- MORTDUE <= 4.64
|   |   |   |   |   |--- weights: [34.82, 666.40] class: 1
|   |   |   |   |--- MORTDUE >  4.64
|   |   |   |   |   |--- CLAGE <= 183.18
|   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |--- CLAGE >  183.18
|   |   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |--- CLAGE >  390.62
|   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |--- DEBTINC >  34.82
|   |   |--- DEBTINC <= 43.75
|   |   |   |--- CLAGE <= 202.30
|   |   |   |   |--- VALUE <= 11.44
|   |   |   |   |   |--- DEROG <= 1.50
|   |   |   |   |   |   |--- LOAN <= 21.45
|   |   |   |   |   |   |   |--- DEBTINC <= 40.19
|   |   |   |   |   |   |   |   |--- YOJ <= 2.35
|   |   |   |   |   |   |   |   |   |--- weights: [16.79, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- YOJ >  2.35
|   |   |   |   |   |   |   |   |   |--- PROBINC <= 912.38
|   |   |   |   |   |   |   |   |   |   |--- CLNO <= 3.28
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.62, 12.77] class: 1
|   |   |   |   |   |   |   |   |   |   |--- CLNO >  3.28
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- PROBINC >  912.38
|   |   |   |   |   |   |   |   |   |   |--- VALUE <= 11.23
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 8
|   |   |   |   |   |   |   |   |   |   |--- VALUE >  11.23
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 8
|   |   |   |   |   |   |   |--- DEBTINC >  40.19
|   |   |   |   |   |   |   |   |--- YOJ <= 2.92
|   |   |   |   |   |   |   |   |   |--- CLAGE <= 170.44
|   |   |   |   |   |   |   |   |   |   |--- NINQ <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- NINQ >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 7.66] class: 1
|   |   |   |   |   |   |   |   |   |--- CLAGE >  170.44
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.62, 17.87] class: 1
|   |   |   |   |   |   |   |   |--- YOJ >  2.92
|   |   |   |   |   |   |   |   |   |--- JOB_ProfExe <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- YOJ <= 3.42
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- YOJ >  3.42
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.49, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- JOB_ProfExe >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.11, 0.00] class: 0
|   |   |   |   |   |   |--- LOAN >  21.45
|   |   |   |   |   |   |   |--- LOAN <= 24.31
|   |   |   |   |   |   |   |   |--- CLAGE <= 53.21
|   |   |   |   |   |   |   |   |   |--- CLNO <= 2.86
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CLNO >  2.86
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 5.11] class: 1
|   |   |   |   |   |   |   |   |--- CLAGE >  53.21
|   |   |   |   |   |   |   |   |   |--- VALUE <= 11.43
|   |   |   |   |   |   |   |   |   |   |--- CLAGE <= 200.84
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [70.26, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- CLAGE >  200.84
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.62, 2.55] class: 1
|   |   |   |   |   |   |   |   |   |--- VALUE >  11.43
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |   |--- LOAN >  24.31
|   |   |   |   |   |   |   |   |--- CLNO <= 3.53
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 10.21] class: 1
|   |   |   |   |   |   |   |   |--- CLNO >  3.53
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |--- DEROG >  1.50
|   |   |   |   |   |   |--- MORTDUE <= 4.04
|   |   |   |   |   |   |   |--- weights: [2.49, 25.53] class: 1
|   |   |   |   |   |   |--- MORTDUE >  4.04
|   |   |   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |   |--- VALUE >  11.44
|   |   |   |   |   |--- LOAN <= 25.00
|   |   |   |   |   |   |--- DELINQ <= 5.00
|   |   |   |   |   |   |   |--- CLAGE <= 173.52
|   |   |   |   |   |   |   |   |--- DEROG <= 0.50
|   |   |   |   |   |   |   |   |   |--- YOJ <= 2.86
|   |   |   |   |   |   |   |   |   |   |--- CLNO <= 3.31
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- CLNO >  3.31
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- YOJ >  2.86
|   |   |   |   |   |   |   |   |   |   |--- JOB_Mgr <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |   |--- JOB_Mgr >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |--- DEROG >  0.50
|   |   |   |   |   |   |   |   |   |--- JOB_ProfExe <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- LOAN <= 22.76
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.62, 10.21] class: 1
|   |   |   |   |   |   |   |   |   |   |--- LOAN >  22.76
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- JOB_ProfExe >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.35, 0.00] class: 0
|   |   |   |   |   |   |   |--- CLAGE >  173.52
|   |   |   |   |   |   |   |   |--- weights: [72.12, 0.00] class: 0
|   |   |   |   |   |   |--- DELINQ >  5.00
|   |   |   |   |   |   |   |--- weights: [0.00, 7.66] class: 1
|   |   |   |   |   |--- LOAN >  25.00
|   |   |   |   |   |   |--- JOB_Self <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 12.77] class: 1
|   |   |   |   |   |   |--- JOB_Self >  0.50
|   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |--- CLAGE >  202.30
|   |   |   |   |--- DELINQ <= 1.50
|   |   |   |   |   |--- CLAGE <= 795.72
|   |   |   |   |   |   |--- VALUE <= 12.55
|   |   |   |   |   |   |   |--- JOB_Sales <= 0.50
|   |   |   |   |   |   |   |   |--- YOJ <= 2.35
|   |   |   |   |   |   |   |   |   |--- LOAN <= 22.14
|   |   |   |   |   |   |   |   |   |   |--- weights: [17.41, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- LOAN >  22.14
|   |   |   |   |   |   |   |   |   |   |--- LOAN <= 22.75
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- LOAN >  22.75
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [5.60, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- YOJ >  2.35
|   |   |   |   |   |   |   |   |   |--- MORTDUE <= 4.54
|   |   |   |   |   |   |   |   |   |   |--- weights: [300.93, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- MORTDUE >  4.54
|   |   |   |   |   |   |   |   |   |   |--- YOJ <= 2.94
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [16.79, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- YOJ >  2.94
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |   |--- JOB_Sales >  0.50
|   |   |   |   |   |   |   |   |--- CLNO <= 3.85
|   |   |   |   |   |   |   |   |   |--- weights: [5.60, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CLNO >  3.85
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |--- VALUE >  12.55
|   |   |   |   |   |   |   |--- LOAN <= 24.90
|   |   |   |   |   |   |   |   |--- weights: [3.11, 0.00] class: 0
|   |   |   |   |   |   |   |--- LOAN >  24.90
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |--- CLAGE >  795.72
|   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |--- DELINQ >  1.50
|   |   |   |   |   |--- YOJ <= 2.52
|   |   |   |   |   |   |--- DELINQ <= 3.50
|   |   |   |   |   |   |   |--- weights: [5.60, 0.00] class: 0
|   |   |   |   |   |   |--- DELINQ >  3.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |--- YOJ >  2.52
|   |   |   |   |   |   |--- VALUE <= 11.82
|   |   |   |   |   |   |   |--- weights: [2.49, 17.87] class: 1
|   |   |   |   |   |   |--- VALUE >  11.82
|   |   |   |   |   |   |   |--- weights: [1.24, 0.00] class: 0
|   |   |--- DEBTINC >  43.75
|   |   |   |--- CLAGE <= 285.54
|   |   |   |   |--- DEBTINC <= 44.57
|   |   |   |   |   |--- CLAGE <= 238.43
|   |   |   |   |   |   |--- weights: [2.49, 20.43] class: 1
|   |   |   |   |   |--- CLAGE >  238.43
|   |   |   |   |   |   |--- weights: [1.87, 0.00] class: 0
|   |   |   |   |--- DEBTINC >  44.57
|   |   |   |   |   |--- weights: [0.62, 155.75] class: 1
|   |   |   |--- CLAGE >  285.54
|   |   |   |   |--- DEBTINC <= 46.62
|   |   |   |   |   |--- weights: [6.84, 0.00] class: 0
|   |   |   |   |--- DEBTINC >  46.62
|   |   |   |   |   |--- weights: [0.00, 2.55] class: 1

Plotting the feature importance of each variable

In [157]:
importances = best_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="Blue", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations from tree

  • The tree is quite complex as complex as compared to the pre-pruned tree.
  • The feature importance is same as we got in pre-pruned tree.

V.6.6. Comparing Decision Tree Models¶

In [158]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[158]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 1.00000 0.86290 0.95901
Recall 1.00000 0.79437 1.00000
Precision 1.00000 0.61633 0.82692
F1 1.00000 0.69412 0.90526
In [159]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_post_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[159]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.86577 0.85235 0.87528
Recall 0.61290 0.74194 0.75538
Precision 0.70370 0.62162 0.68039
F1 0.65517 0.67647 0.71592

Observations

  • Decision tree model with default parameters is overfitting the training data and is not able to generalize well.
  • Pre-pruned tree has given a generalized performance with balanced values of precision and recall.
  • Post-pruned tree is giving a high F1 score as compared to other models but the difference between precision and recall is high.
  • The bank will be able to maintain a balance between resources and brand equity using the pre-pruned decision tree model.

VI. Random forest¶

VI.1. Data preparation for random forest model¶

In [160]:
df.head()
Out[160]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON JOB
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 HomeImp Other
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 HomeImp Other
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 HomeImp Other
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 DebtCon Other
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 HomeImp Office
In [161]:
#Let's identify the categorical features
df.columns[df.dtypes == object]
Out[161]:
Index(['REASON', 'JOB'], dtype='object')
In [162]:
# apply get_dummies function
df_encoded = pd.get_dummies(df, columns=['REASON','JOB'])
df_encoded.head()
Out[162]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON_DebtCon REASON_HomeImp JOB_Mgr JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 1 0 0 0 1 0 0 0
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 0 1 0 1 0 0 0 0
In [163]:
df_encoded.head()
Out[163]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC PROBINC BAD REASON_DebtCon REASON_HomeImp JOB_Mgr JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 17.10492 3.56105 10.57221 3.02042 0.00000 0.00000 94.36667 1.00000 2.94444 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
1 17.10492 4.03347 11.13327 2.83321 0.00000 2.00000 121.83333 0.00000 3.17805 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
2 17.10492 3.28316 9.72376 2.63906 0.00000 0.00000 149.46667 1.00000 2.99573 34.81826 1975.70831 1 0 1 0 0 1 0 0 0
3 17.10492 3.99604 11.39915 2.83321 0.00000 0.00000 173.46667 1.00000 3.40120 34.81826 1975.70831 1 1 0 0 0 1 0 0 0
4 17.10492 4.20526 11.62634 2.56495 0.00000 0.00000 93.33333 0.00000 3.17805 34.81826 1975.70831 0 0 1 0 1 0 0 0 0

VI.2. Train-Test-Split for Random Forest model¶

In [164]:
Y = df_encoded.BAD
X = df_encoded.drop("BAD", axis = 1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y )
In [165]:
x_train.shape, x_test.shape
Out[165]:
((4768, 19), (1192, 19))

Over Sampling Using SMOTE

In [166]:
sm = SMOTE(random_state=12)
x_train_r, y_train_r = sm.fit_resample(x_train, y_train)

VI.3. Building random forest model¶

In [167]:
clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)
clf_rf.fit(x_train_r, y_train_r)
Out[167]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)

VI.4. Checking random forest model performance on the training set¶

In [168]:
confusion_matrix_sklearn(clf_rf, x_train, y_train)
In [169]:
rf_perf_train = model_performance_classification_sklearn(
    clf_rf, x_train, y_train
)
rf_perf_train
Out[169]:
Accuracy Recall Precision F1
0 1.00000 1.00000 1.00000 1.00000

VI.5. Checking random forest model performance on the testing set¶

In [170]:
confusion_matrix_sklearn(clf_rf, x_test, y_test)
In [171]:
rf_perf_test = model_performance_classification_sklearn(
    clf_rf, x_test, y_test
)
rf_perf_test
Out[171]:
Accuracy Recall Precision F1
0 0.91107 0.76471 0.78448 0.77447
  • Model is overfitting the test set. However it has the highest recall, accuracy yet

VII. Bagging - Model Building and Hyperparameter Tuning¶

VII.1. Data preparation for Bagging¶

In [172]:
y = df_encoded.BAD
X = df_encoded.drop("BAD", axis = 1)

VII.2. Train-Test-Split data for bagging models¶

In [173]:
# Splitting the data into train and test sets in 70:30 ratio
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state=1, shuffle = True )
In [174]:
x_train.shape, x_test.shape
Out[174]:
((4172, 19), (1788, 19))

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model. The model_performance_classification_statsmodels function will be used to check the model performance of models. The confusion_matrix_statsmodels function will be used to plot the confusion matrix.

In [175]:
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
    model, predictors, target, threshold=0.5
):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [176]:
# defining a function to plot the confusion_matrix of a classification model


def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

VII.2.1. RANDOM FOREST¶

VII.2.1.1. Building the model for bagging with Random Forest¶

In [177]:
rf_classifier=RandomForestClassifier(random_state=42)
rf_classifier.fit(x_train,y_train)
Out[177]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)

VII.2.1.2. Checking the training model performance for bagging with Random Forest¶

In [178]:
rf_classifier_model_train_perf = model_performance_classification_sklearn(rf_classifier, x_train,y_train)
print("Training performance \n",rf_classifier_model_train_perf)
Training performance 
    Accuracy  Recall  Precision      F1
0   1.00000 1.00000    1.00000 1.00000

VII.2.1.3. Checking the testing model performance for bagging with Random Forest¶

In [179]:
rf_classifier_model_test_perf = model_performance_classification_sklearn(rf_classifier, x_test,y_test)
print("Testing performance \n",rf_classifier_model_test_perf)
Testing performance 
    Accuracy  Recall  Precision      F1
0   0.91163 0.68817    0.85906 0.76418

VII.2.1.4. Plotting the feature importance of each variable¶

In [180]:
print(pd.DataFrame(rf_classifier.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
                   Imp
DEBTINC        0.18387
PROBINC        0.11402
CLAGE          0.09490
DELINQ         0.08407
VALUE          0.08254
MORTDUE        0.07885
LOAN           0.07853
CLNO           0.07212
YOJ            0.05547
DEROG          0.05260
NINQ           0.04228
REASON_HomeImp 0.00923
JOB_Office     0.00907
JOB_Other      0.00887
REASON_DebtCon 0.00868
JOB_ProfExe    0.00867
JOB_Mgr        0.00710
JOB_Sales      0.00529
JOB_Self       0.00383
In [181]:
feature_names = x_train.columns
importances = rf_classifier.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

VII.2.1.5. Observations about outcomes of model bagging with Random Forest Classifier¶

  • The model is overfitting the test set. recall is quite low

VII.2.2. ADABOOST FOREST¶

VII.2.2.1. Building the model for bagging with Adaboost¶

In [182]:
from sklearn.ensemble import AdaBoostClassifier
In [183]:
ab_classifier=AdaBoostClassifier(n_estimators=100, learning_rate=0.2, random_state=4)
ab_classifier.fit(x_train,y_train)
Out[183]:
AdaBoostClassifier(learning_rate=0.2, n_estimators=100, random_state=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(learning_rate=0.2, n_estimators=100, random_state=4)

VII.2.2.2. Checking the training model performance for bagging with Adaboost¶

In [184]:
ab_classifier_model_train_perf = model_performance_classification_sklearn(ab_classifier, x_train,y_train)
print("Training performance \n",ab_classifier_model_train_perf)
Training performance 
    Accuracy  Recall  Precision      F1
0   0.89334 0.56671    0.83574 0.67542

VII.2.2.3. Checking the testing model performance for bagging with Adaboost¶

In [185]:
ab_classifier_model_test_perf = model_performance_classification_sklearn(ab_classifier, x_test,y_test)
print("Testing performance \n",ab_classifier_model_test_perf)
Testing performance 
    Accuracy  Recall  Precision      F1
0   0.88647 0.54032    0.86266 0.66446

VII.2.2.4. Plotting the feature importance of each variable¶

In [186]:
print(pd.DataFrame(ab_classifier.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
                   Imp
DEBTINC        0.36000
DELINQ         0.14000
CLAGE          0.10000
CLNO           0.07000
PROBINC        0.07000
DEROG          0.06000
VALUE          0.04000
NINQ           0.04000
LOAN           0.03000
YOJ            0.03000
MORTDUE        0.02000
JOB_Office     0.02000
JOB_Sales      0.02000
REASON_DebtCon 0.00000
REASON_HomeImp 0.00000
JOB_Mgr        0.00000
JOB_Other      0.00000
JOB_ProfExe    0.00000
JOB_Self       0.00000
In [187]:
feature_names = x_train.columns
importances = ab_classifier.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

VII.2.3. Gradient Boosting Classifier¶

VII.2.3.1. Building the model for bagging with Gradient Boosting¶

In [188]:
from sklearn.ensemble import GradientBoostingClassifier
In [189]:
gb_classifier=GradientBoostingClassifier(n_estimators=100, learning_rate=0.2, random_state=4)
gb_classifier.fit(x_train,y_train)
Out[189]:
GradientBoostingClassifier(learning_rate=0.2, random_state=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.2, random_state=4)

VII.2.3.2. Checking the training model performance for bagging with Gradient Boosting¶

In [190]:
gb_classifier_model_train_perf = model_performance_classification_sklearn(gb_classifier, x_train,y_train)
print("Training performance \n",gb_classifier_model_train_perf)
Training performance 
    Accuracy  Recall  Precision      F1
0   0.94535 0.78703    0.92253 0.84941

VII.2.3.3. Checking the testing model performance for bagging with Gradient Boosting¶

In [191]:
gb_classifier_model_test_perf = model_performance_classification_sklearn(gb_classifier, x_test,y_test)
print("Testing performance \n",gb_classifier_model_test_perf)
Testing performance 
    Accuracy  Recall  Precision      F1
0   0.90940 0.66667    0.86713 0.75380

VII.2.3.4. Plotting the feature importance of each variable¶

In [192]:
print(pd.DataFrame(gb_classifier.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
                   Imp
DEBTINC        0.31110
PROBINC        0.18286
DELINQ         0.13044
DEROG          0.07112
CLAGE          0.07053
CLNO           0.04807
VALUE          0.04442
MORTDUE        0.04226
LOAN           0.03072
YOJ            0.02787
NINQ           0.02290
JOB_Office     0.00406
JOB_Sales      0.00405
JOB_Other      0.00308
JOB_Mgr        0.00263
JOB_ProfExe    0.00180
REASON_HomeImp 0.00152
REASON_DebtCon 0.00057
JOB_Self       0.00000
In [193]:
feature_names = x_train.columns
importances = gb_classifier.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show();

VII.2.4. XGBoost Classifier¶

VII.2.4.1. Building the model for bagging with XG Boosting¶

In [194]:
from xgboost import XGBClassifier
In [195]:
xgb_classifier=XGBClassifier(random_state=1, verbosity = 0)
xgb_classifier.fit(x_train,y_train)
Out[195]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=1, ...)

VII.2.4.2. Checking the training model performance for bagging with XG Boosting¶

In [196]:
xgb_classifier_model_train_perf = model_performance_classification_sklearn(xgb_classifier, x_train, y_train)
print("Training performance \n",xgb_classifier_model_train_perf)
Training performance 
    Accuracy  Recall  Precision      F1
0   0.99832 0.99388    0.99754 0.99571

VII.2.4.3. Checking the testing model performance for bagging with XG Boosting¶

In [197]:
xgb_classifier_model_test_perf = model_performance_classification_sklearn(xgb_classifier, x_test, y_test)
print("Testing performance \n",xgb_classifier_model_test_perf)
Testing performance 
    Accuracy  Recall  Precision      F1
0   0.91387 0.69624    0.86333 0.77083

VII.2.4.4. Plotting the feature importance of each variable¶

In [198]:
print(pd.DataFrame(xgb_classifier.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
                   Imp
DEBTINC        0.21867
PROBINC        0.14010
DELINQ         0.12172
DEROG          0.07227
JOB_Sales      0.05035
JOB_Self       0.04820
CLAGE          0.03957
JOB_Mgr        0.03616
NINQ           0.03152
JOB_Office     0.03132
CLNO           0.03057
YOJ            0.02938
VALUE          0.02908
MORTDUE        0.02804
LOAN           0.02752
REASON_DebtCon 0.02740
JOB_ProfExe    0.02386
JOB_Other      0.01427
REASON_HomeImp 0.00000
In [199]:
feature_names = x_train.columns
importances = xgb_classifier.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show();

VIII.5. Comparing all models¶

In [200]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
        svm_perf_train.T, 
        rf_perf_train.T, 
        rf_classifier_model_train_perf.T,
        ab_classifier_model_train_perf.T, 
        gb_classifier_model_train_perf.T,
        xgb_classifier_model_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
    "Support Vector Merchine",
    "Random Forest (resampled)",
    "Bagging", 
    "Ada Boost","Gradient Boost", "XG Boost",
]

# test set  performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
        
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_post_test.T,
        
        svm_perf_test.T, 
        rf_perf_test.T, 
        rf_classifier_model_test_perf.T,
        ab_classifier_model_test_perf.T, 
        gb_classifier_model_test_perf.T,
        xgb_classifier_model_test_perf.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
    "Support Vector Merchine",
    "Random Forest (resampled)",
    "Bagging", 
    "Ada Boost","Gradient Boost", "XG Boost",
]
In [201]:
models_train_comp_df.T  
Out[201]:
Accuracy Recall Precision F1
Logistic Regression-default Threshold 0.85163 0.35373 0.76053 0.48287
Logistic Regression-0.37 Threshold 0.76103 0.68543 0.43077 0.52905
Logistic Regression-0.42 Threshold 0.82814 0.53611 0.56443 0.54991
Decision Tree sklearn 1.00000 1.00000 1.00000 1.00000
Decision Tree (Pre-Pruning) 0.86290 0.79437 0.61633 0.69412
Decision Tree (Post-Pruning) 0.95901 1.00000 0.82692 0.90526
Support Vector Merchine 0.96099 0.92429 0.88520 0.90432
Random Forest (resampled) 1.00000 1.00000 1.00000 1.00000
Bagging 1.00000 1.00000 1.00000 1.00000
Ada Boost 0.89334 0.56671 0.83574 0.67542
Gradient Boost 0.94535 0.78703 0.92253 0.84941
XG Boost 0.99832 0.99388 0.99754 0.99571
In [202]:
models_test_comp_df.T
Out[202]:
Accuracy Recall Precision F1
Logistic Regression-default Threshold 0.82998 0.29301 0.72667 0.41762
Logistic Regression-0.37 Threshold 0.75951 0.67742 0.44840 0.53961
Logistic Regression-0.42 Threshold 0.81823 0.48656 0.57460 0.52693
Decision Tree sklearn 0.86577 0.61290 0.70370 0.65517
Decision Tree (Pre-Pruning) 0.85235 0.74194 0.62162 0.67647
Decision Tree (Post-Pruning) 0.87528 0.75538 0.68039 0.71592
Support Vector Merchine 0.83305 0.40756 0.62581 0.49364
Random Forest (resampled) 0.91107 0.76471 0.78448 0.77447
Bagging 0.91163 0.68817 0.85906 0.76418
Ada Boost 0.88647 0.54032 0.86266 0.66446
Gradient Boost 0.90940 0.66667 0.86713 0.75380
XG Boost 0.91387 0.69624 0.86333 0.77083

MODEL EVALUATION AND CONCLUSION¶

Model Evaluation Criterion¶

The nature of predictions made by the models will translate as follows:

True positives (TP) are defaults correctly predicted by the model.

False negatives (FN) are defaulted clients in reality, which the model has predicted not-defaulted

False positives (FP)are not-defaulted clients in reality, which the model has predicted, defaulted

Model can make wrong predictions as:

Predicting a default but in reality the client has not defaulted . Predicting a Not-defaulted but in reality the client has defaulted

Which case is more important?

If we predict a client will default and in actuallity they do not, for banks this doesnt particularly hurt so much - no real financial loss.

If, on the other hand, we predict the client will not default and an it does default in reality, this will be quite hurtful to the bank leading financial loss i.e loan write off, which directly impact bottom line

To reduce this loss Recall should be maximized (Need to reduce False Negative), the greater the recall the higher the chances of identifying both the classes correctly.

Conclusion¶

Random Forest (resampled) with the highest recall of 0.76 and model accuracy of 91% and Decision tree with equally high recall are the models of interest

Amongst all the models we tried in this case Random forest(resampled) has the best recall on test score of 76% and overal model accuracy of 91%

Feature Importance For our Best Model - Random Forest (Resampled )¶

In [203]:
feature_names = x_train.columns
importances = clf_rf.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='Blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

The three most important feature that determine wether a client will pay or not are DEBTIN,DELINQ and PROBINC

DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due)

DEBTINC: Debt-to-income ratio (all monthly debt payments divided by gross monthly

PROBINC is (current debt on mortgage)/(debt to income ratio).